Monitoring API: remove maximum retention days prometheus#2851
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Hello @marioferh! Some important instructions when contributing to openshift/api: |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (5)
📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR converts Prometheus retention to an hours-based field in the API: the Go type Renamed Suggested reviewers
🚥 Pre-merge checks | ✅ 15✅ Passed checks (15 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.12.2)Error: build linters: unable to load custom analyzer "kubeapilinter": tools/_output/bin/kube-api-linter.so, plugin: not implemented Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
80e0c7d to
03a1de9
Compare
|
While this should be ok not having a tombstone as we're just loosening limits, the openapi.json diff includes unrelated changes (NetworkObservabilitySpec, registries descriptions, softirqs collector, KMS default) which are likely from other PRs that merged into master since the branch was created. Could you rebase on latest master so the openapi.json diff only shows the PR changes? This would avoid confusion during review and prevent accidentally shipping stale generated data. |
03a1de9 to
854e0ad
Compare
|
Hey @marioferh, as commented before we go with just removing the max constraints, I think we should also address the granularity issue that Ayoub and Simon raised. Right now durationInDays: int32 can't express sub-day retention like 15h or 24h (which is the default for UWM), and sizeInGiB: int32 can't express sub-GiB sizes like 500MiB. The CMO ConfigMap and the Prometheus Operator CRD both accept freeform Prometheus duration/size strings, so customers migrating from retention: "15h" would have no equivalent in the new CRD. I think we should push back on the integer-only convention for these two fields and switch to string fields with the same Prometheus pattern validation that the Prometheus Operator CRD uses: duration: string with pattern ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ @everettraven I know we discussed avoiding duration strings back in #2653, but this comes from the Prometheus Operator CR itself, which validates retention with ^(0|(([0-9]+)y)?(([0-9]+)w)?(([0-9]+)d)?(([0-9]+)h)?(([0-9]+)m)?(([0-9]+)s)?(([0-9]+)ms)?)$ and size with (^0|([0-9]*[.])?[0-9]+((K|M|G|T|E|P)i?)?B)$ (https://github.com/openshift/cluster-monitoring-operator/blob/0ce8e4c0bce00381802c25d92ed238775139ee82/assets/monitoring-operator/00_0prometheus-custom-resource-definition.yaml). CMO passes these values straight through to the Prometheus CR, so using durationInDays: int32 and sizeInGiB: int32 introduces a lossy conversion layer that breaks existing configs like retention: "15h" or retentionSize: "500MiB" that customers can set today in the ConfigMap. Would you be open to using string fields with these same Prometheus patterns for this specific case? They're fully validatable at admission time via the regex, and since this is v1alpha1 behind a feature gate we can still change it. Worst case, as a fallback, we may set hours, but wanted to get back your thoughts on this! Thanks! |
| @@ -4152,9 +4150,7 @@ spec: | |||
| When the limit is reached, Prometheus will delete oldest data first. | |||
| When omitted, no size limit is enforced and Prometheus uses available PersistentVolume capacity. | |||
| Minimum value is 1 GiB. | |||
| Maximum value is 16384 GiB (16 TiB). | |||
| format: int32 | |||
| maximum: 16384 | |||
There was a problem hiding this comment.
We strongly recommend having reasonable upper bounds here. If nothing else, at least a maximum of 2,147,483,647 (the int32 maximum).
It is also worth noting that once you set a maximum and it GAs, we won't let you decrease it without strong evidence that it does not break user workflows.
It is usually easier for us to increase maximum constraints as needed than to add one or decrease it.
There was a problem hiding this comment.
Also noting here that 2,147,483,647 GiB is ~ 2047 PiB. That seems pretty high...
| Minimum value is 1 day. | ||
| Maximum value is 365 days (1 year). | ||
| format: int32 | ||
| maximum: 365 |
There was a problem hiding this comment.
Just a note, a maximum of 2,147,483,647 here is equivalent to ~58835 centuries. Seems like an unreasonably high number that would never be set.
Without more context, I would think it to be reasonable to increase this to something like 3650 to account for ~10 years of retention. That seems pretty high to me for metrics, but 🤷
We strongly encourage not using durations in favor of a unit-based integer, but if you folks feel strongly that this is the best UX for your end-users I'm not going to prevent you from doing so. I'm not the expert on your user base here, you folks are :). That being said, if the concern is solely that you cannot reasonably express smaller unit values that are likely to occur, you can change the unit you are using for the integer field to the smallest reasonable unit. For example, Even if this is in v1alpha1 we still have to abide by tombstoning rules if we change the serialization of the fields unless we want to move this to a v1alpha2. |
854e0ad to
71e6477
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@config/v1alpha1/types_cluster_monitoring.go`:
- Around line 2238-2247: The public JSON name was changed and will break
existing manifests; restore wire compatibility by keeping the original JSON
field name for the struct field DurationInHours (e.g. change the tag to
`json:"durationInDays,omitempty"` so existing persisted objects continue to
round-trip) and add a deprecation/kubebuilder comment on the field explaining it
represents hours (or add a new explicitly named field for hours in a new API
version/conversion path if you intend to support both); update the struct field
DurationInHours and its kubebuilder tags accordingly and ensure any API
docs/comments reflect the restored wire name and units.
- Around line 2243-2255: Remove the hard upper-bound metadata and docs for
retention and storage fields: delete the `+kubebuilder:validation:Maximum=87600`
and the "Maximum value is 87600 hours..." doc line for the DurationInHours field
(symbol: DurationInHours) and delete the
`+kubebuilder:validation:Maximum=2147483647` and the "Maximum value is
2147483647 GiB..." doc line for the sizeInGiB field (symbol: sizeInGiB); keep
the Minimum validations and the rest of the comments intact so the API no longer
advertises or enforces an upper bound.
In
`@payload-manifests/crds/0000_10_config-operator_01_clustermonitorings.crd.yaml`:
- Around line 4136-4147: The CRD change renamed the serialized retention field
in v1alpha1 from spec.prometheusConfig.retention.durationInDays to
durationInHours which will break existing stored objects and clients; revert the
rename in the v1alpha1 CRD so the field remains durationInDays, restore its
original description/format/minimum/maximum, and instead introduce any new
representation by adding a new API version (e.g., v1beta1) with durationInHours
plus proper conversion webhooks or conversion funcs and tombstoning logic to
translate between durationInDays and durationInHours during reads/writes.
- Around line 4143-4145: The CRD still enforces a 10-year ceiling via the schema
entry "maximum: 87600"; remove that maximum constraint so the API no longer
rejects larger retention values at admission time. Locate the clustermonitorings
CRD schema where "format: int32" and "maximum: 87600" appear (the
retention/hours field) and delete the "maximum: 87600" line (and any equivalent
numeric maximums on related retention fields) so the property is unconstrained
by the CRD schema. Leave the type/format (e.g., "type: integer" / "format:
int32") intact if needed, and run schema validation tests to ensure no other
admission-side maximums remain.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 003efd47-3c8c-4d49-aac1-a3b47834cee3
⛔ Files ignored due to path filters (5)
config/v1alpha1/zz_generated.crd-manifests/0000_10_config-operator_01_clustermonitorings.crd.yamlis excluded by!**/zz_generated.crd-manifests/*config/v1alpha1/zz_generated.featuregated-crd-manifests/clustermonitorings.config.openshift.io/ClusterMonitoringConfig.yamlis excluded by!**/zz_generated.featuregated-crd-manifests/**config/v1alpha1/zz_generated.swagger_doc_generated.gois excluded by!**/zz_generated*openapi/generated_openapi/zz_generated.openapi.gois excluded by!openapi/**,!**/zz_generated*openapi/openapi.jsonis excluded by!openapi/**
📒 Files selected for processing (2)
config/v1alpha1/types_cluster_monitoring.gopayload-manifests/crds/0000_10_config-operator_01_clustermonitorings.crd.yaml
|
@everettraven besides the override on the CI, would we need to tombstone this due to the filed removal? (change, technically, but anyway). Thanks! |
We should collect data via telemetry in the first place. |
Express Prometheus retention duration in hours under the existing durationInDays JSON field for wire compatibility. Remove operational maximums (365 days, 16384 GiB, and the interim 87600-hour cap); keep sizeInGiB bounded by the int32 maximum at admission time. Signed-off-by: Mario Fernandez <mariofer@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com>
What was proposed is to adopt duration format already supported by the upstream Prometheus operator CRDs. These CRDs aren't perfect but they have been there for many years and I bet that our users are familiar with the format. |
71e6477 to
fcd3ab5
Compare
|
Actionable comments posted: 0 |
|
@marioferh: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| // +optional | ||
| DurationInDays int32 `json:"durationInDays,omitempty"` | ||
| DurationInHours int32 `json:"durationInDays,omitempty"` | ||
| // sizeInGiB specifies the maximum storage size in gibibytes (GiB) that Prometheus |
There was a problem hiding this comment.
Is gibibytes a small enough unit? Is there a reasonable use case to limit this to smaller than a gibibyte?
| // +kubebuilder:validation:Maximum=365 | ||
| // +optional | ||
| DurationInDays int32 `json:"durationInDays,omitempty"` | ||
| DurationInHours int32 `json:"durationInDays,omitempty"` |
There was a problem hiding this comment.
You can't change the semantics of an existing field name either. The correct way to do this is to tombstone the durationInDays field by commenting it out entirely (and adding a notice that the field has been tombstoned) and adding the new durationInHours field to replace it.
We aren't concerned about existing instances because this API will only ever be present in a TPNU cluster, but we are concerned about potential clients that may serialize/deserialize based on the existing field name which is why we tombstone here (to make sure we never introduce a field with the same name but different serialization to prevent breaking clients). The tombstone can be removed when the API version is increased.
Remove Maximum validation and documentation for durationInDays (365)
and sizeInGiB (16384) on Prometheus Retention. We do not have enough
customer or operational data to pick defensible API-level ceilings;
arbitrary caps could block valid large-cluster configurations or
misalign with what CMO/Prometheus can actually support.